Time - 60-120 minutes
To briefly recap, last week we:
group_by()ggplot() and identify some common plotting issuesgeoms and how they can be used to make different sorts of plotsToday we are going to:
So far you have learned to read in data, manipulate that data, and then do basic visulisation of that data. This week we are going to take the obvious next step and consider how we run statistical analyses in R.
The first point to make here is this is going to be a brief introduction to some useful statistical methods. R was originally developed as a statistical programming language (i.e. not primarily used for data visualisation and manipulation) and consequently any analysis you can think of you can probably do in R (including implementing neural networks, MCMC bayesian analyses, social network analysis, etc, etc).
We are going to focus on a single suite of modeling methods in-depth.
We will give you a brief refresher on the basics of frequentest statistics in the introductory lecture associated with this practical, however this will be a brief introduction (as this isn’t a statistics course!) so before you come to the session we suggest you re-familiarise yourself with:
If you are someone who hasn’t done much statistics previously, or if you haven’t come across the concepts listed above, then you will need to refresh/ensure you have a grounding in these concepts and terms before the start of this week’s practical, or you risk being rapidly left behind! Luckily, there are huge amounts of excellent video resources available on youtube discussing these concepts in depth, so spend some time watching these.
Generalized Linear Models (GLMs) are a critical and hugely useful technique for data analysis. They are Generalised Linear Models because they allow you to specify the error distribution of your model (e.g. gaussian, poisson, etc) and link function (e.g. identity, log, etc). This means that they are hugly flexible you can fit them to count data, binary data, or data where the predictor variables are categorical or continuous. Critically, GLMs can be fit to data where the errors are normal or non-normally distributed. Linear regressions - which I am sure you are all familiar with - are a special case of the GLM, where data has a gaussian (normal) distribution with an identity link (see below).
As we said above, GLMs are fabulously flexible - they can do ANOVA style analyses with categorical predictors, linear regressions with continuous predictor variables, and mix both categorical and continuous predictors in multiple regression analyses. We will also consider the extension to GLMs where we can account for non-randomness between samples (so called mixed models). GLMs are complex beasts, but are widely used to analyse biological data because of its often nested nature, and lack of normality, and so we are going to cover this topic in-depth here rather than spread thinly across many different statistical techniques. We will touch on some other statistics in upcoming lectures.
The following two sections are a quick refresher on error distributions and link functions. Feel free to skip these if you are comfortable with the topics.
An error distribution specifies the structure of the error in the model - i.e. the variance in the data not explained by the predictor variables. Take the example of a gaussian (normal) distribution - we expect the residuals of the model to be normally distributed around the fitted model:
Here we have a linear model (blue horizontal line in the Fitted model pannel) where there is no effect of a change in the x value on the y values (i.e. the slope of the blue line is ~0). The data are plotted in light blue, and the residual error (difference between the predicted values [i.e. the blue horizontal line] - and the observed values [i.e. the light blue dots] is shown by the grey lines). You can see that the residuals are normally distributed around that fitted blue line (see histogram of residuals).
Remember, we are trying to minimise the residual error as less error = a model which fits the data better. So we are looking for a roughly normal distribution of errors in our histogram of the residuals (☑), with a roughly linear patter in our fitted vs residual plot (☑).
If we now fit a gaussian model to count data:
You will see that the residuals are very not normally distributed (because we are working with counts which are
Error distributions allow us to deal with data which are non-normally distributed, or which don’t conform to the assumotions of the normal distribution (e.g. are integers). In the above example we would probably use a poisson distribution (specifically designed to cope with positive integer values).
For more information on the error families for glm’s have a look at the help - ?family.
Link functions - as the name suggests - link two things within our model - they link the data and the error distribution. The easiest way to think of them is a transformation applied to the data before it is passed to the model. The link function relates the mean value of y to the linear predictor (x).
Again, taking the example of the normal distribution - the default link function is an “identity” link. This is basically saying that we don’t need to do any transformation to the data - we are assuming that its error follows the normal distribution.
If we look at the histogram above then the distribution is rougly normal (if we squint a bit) but now our fitted vs residual plot shows a clear banana curve in the plotted points - we want to avoid this.
If we then fit this model with a log link:
You can see that it hasn’t changed our histogram of residuals much, but has changed the trend in the fitted vs residuals plot, which now looks good.
Note - looking for a normal distribution of residuals is useful when the residuals for a distribution are EXPECTED to be normally distributed. This means that residual plots like those above are useful for some error distribution families, but not all.
This is quite a complex topic, and one which can be hard to get your head around. If you aren’t comfortable with these issues then I suggest you do some reading to familiarise yourself with them. Here are a few places to get started:
Time - 60 minutes
So, let’s think about fitting a GLM to data. We are going to work on a few different data sets to example how GLMs can be used in different scenarios (and how flexible they are). First off we are going to briefly go through an example using a GLM to carry out an ANOVA style analysis (looking to see whether there are differences between the mean values of different groups in data). Then we will deep-dive into a more xomplex example later on.
For our simple first example we will analyse one of the data sets included in R - Iris. So, lets load the iris data set. As it is included in R already, we can simply do:
##load the iris data set
data("iris")
First thing’s first, we always want to visualize our data. This is critical to understanding the structure of our data, and thus how the data should be analysed.
Task
Plot the Sepal.Width* data grouped (split) by species so we can eyeball whether there might be differences between the three species in the data. In the below plot I have “jittered” the data, which means we have artificially added variance to the data on the x-axis so we can see the points more easily. See if you can do this in your plot (hint - geom_ is your friend!)
*Note - a sepal is a part of a plant which acts to protect/support the flower.
## Plot the sepal widths so we can visualise if there are differences between the different species
ggplot(iris, aes(x=Species, y=Sepal.Width)) +
geom_jitter(aes(col=Species)) +
theme_bw()
Great, so it looks like there might be some differences between the species (particularly between setosa and the other two species). We want to test if these differences are down to chance (e.g. an artifact of sampling), or whether probability suggests that these differences are really there between the species. To do this we want to compare the variance and means of the Sepal.Width between the three species.
However, as we have plotted it above it is quite hard to discern what the distributions of the data are (e.g. are they normally distributed?), so we can do a better job at visualising the data.
Histograms are particularly useful for seeing whether data are normally distributed, and are (luckily) easy to do with ggplot.
Task
Tweak the following code to make a plot like the one below:ggplot(..., aes(x=..., fill = Species)) +
## bin width determines how course the histogram is
## the alpha determines the transparency of the bars
## position allows you to determine what kind of histogram you plot (e.g. stacked vs overlapping). try changing to position="stack"
geom_histogram(binwidth = .1, alpha = .5, position="identity")
ggplot(iris, aes(x=Sepal.Width, fill=Species)) +
geom_histogram(binwidth=.1, alpha=.5, position="identity")
So from the above we can visualise:
So we want to test whether there is any significant difference between the mean sepal widths of the three species. This means we have a continuous response (y) variable, and a categorical predictor (x) variable. This makes the coding for our GLM nice and simple. We have visually checked the data and there don’t appear to be any obvious issues with it, so we can forge ahead.
GLM is a base function in R, so we don’t need to load any packages. To run a GLM:
##fit a glm()
mod_iris <- glm(Sepal.Width ~ Species,
##specify the data
data = iris,
##specify the error structure
family = "gaussian")
Here we specify the formula of the model using ‘~’, where the first term is the dependant (y/response) variable, and the following term/terms is/are the independant (x/predictor) variables - i.e. the ones we believe shoud influence the value of y, but which arent influenced by the value of y. So Sepal.Width is our dependant vairaible, and Species is our independant variable.
Hint - you can read ~ in R as “as a function of”. So in this case, we are testing whether the Sepal.Width changes as a function of species
We have also specified the “family” to be “gaussian” (actually we didnt need to do this as this is the default setting, but its good practice to make sure we understand what exactly we are doing). Family here means the error structure, so we are assuming (initially) that our residuals are normally distributed, this means that - given our current settings - we are actually just fitting a ANOVA to the data.
Statistics is as much art as science, and assessing how well a model fits data can be complex and time consuming. There isn’t an automated or best approach for doing it.
Remember:
Luckily R has some nifty in built functions for helping us. We will start of using some of the base methods to do this, and then move onto some more complex ways later today and next week.
Once we have coded and run the model we have an object (above called mod_iris):
##display the class of the model object
class(mod_iris)
## [1] "glm" "lm"
You can see this is a special class, a GLM and LM (LM just means linear regression). This class means that R recognises what this is (a model) and can help us to analyse this object. For example, if we run the following, we get a series of diagnostic plots which help us intepret how well our model fits our data:
##display the class of the model object
plot(mod_iris)
Breifly, for the above plots:
We will go more into assessing the fits of a model later on in this session and next week, but for now it looks like this simple modelfits the data well and which we can be pretty confident in. In fact, this is about as good a fit as you are going to get when fitting a model to real data!
So we have fit our model, and ensured that our model fits the data well. Now we want to look at what our model says about the data.
First off let’s look at the model output, using the summary() function:
##summarise the model outputs
summary(mod_iris)
##
## Call:
## glm(formula = Sepal.Width ~ Species, family = "gaussian", data = iris)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.128 -0.228 0.026 0.226 0.972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.42800 0.04804 71.359 < 2e-16 ***
## Speciesversicolor -0.65800 0.06794 -9.685 < 2e-16 ***
## Speciesvirginica -0.45400 0.06794 -6.683 4.54e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1153878)
##
## Null deviance: 28.307 on 149 degrees of freedom
## Residual deviance: 16.962 on 147 degrees of freedom
## AIC: 106.73
##
## Number of Fisher Scoring iterations: 2
You can see this gives us further details on the model. The important bits for us at this stage are the coefficients.
Note - for a full discussion of the output see the excellent posts here and here.
We can see there is a significant negative effect of both Speciesversicolor and Speciesvirginica. But what is this measured against? And where is the third species, setosa?! Well, R takes the first (alphabetically in our case as these are categorical predictors) value of our predictor variable and uses this as our comparison point. This is called the intercept in the model output.
So, to read our output table:
The estimate value for the intercept is the estimate of the mean value of the Sepal.Width for our interecept species (setosa) - 3.428. If we look at our first plot (and here I’ve added a crossed circle at 3.428):
We can see that that looks about right.
Our values for our other two species are relative to this mean. So to calculate the mean value for versicolor we need to add its Estimate (from the above table) to the Intercept:
3.428 + -0.65800
## [1] 2.77
And the same for virginica:
3.428 + -0.45400
## [1] 2.974
Again, eyeballing the graph above corroborates these estimated mean values for each species. Great!
We can also see there is a significant difference in all three values in our table (all values in the Pr(>|t|) column are <0.05 - i.e. there is a less than 5% probability this pattern occurred by chance).
For the Intercept value, we interpret this as the intercept Estimate being significantly different from 0 (not very important in our case).
For the other two values, our interpretation is that they are significantly different from our intercept (the mean value for setosa).
Question
Is the mean value of versicolor significantly different from the mean value for virginica?
The simple answer to the above question is…we don’t know. Or at least, we can’t tell from this analysis. All we are doing here is testing for differences from the intercept (setosa). To discern whether there are differences between the other species we need to do some further analysis.
To find out whether there are significant differences between our groups we need to do a multiple comparisons test (i.e. compare the differences between all of the categories in our predictor variable). Luckily, this is easy to do in R:
## load the multcomp pack
library(multcomp)
## run the multiple comparisons, and look at the summary output:
summary(glht(mod_iris, mcp(Species="Tukey")))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: glm(formula = Sepal.Width ~ Species, family = "gaussian", data = iris)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## versicolor - setosa == 0 -0.65800 0.06794 -9.685 < 0.001 ***
## virginica - setosa == 0 -0.45400 0.06794 -6.683 < 0.001 ***
## virginica - versicolor == 0 0.20400 0.06794 3.003 0.00754 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
Question
Using your newly aquired knowledge of how to read model summaries in R: which of the species are significantly different from one another?
Time - 80 minutes
So, now you know how to fit a simple GLM. Next, we will work on a single population time series from the species data you plotted for homework last week, and see whether there has been a significant increase, decrease, or no change in that population over time.
We will go through the process of fitting a GLM to the data, checking the fit and other potential models, and then think about plotting the outputs of the model.
Task
Load in the species data you used last week from the Bioinformatics Github repository (using vroom and the url to the data, and then reshape your data into long format. This is the code you wrote for your homework in week 3. However we aren’t interested in any values which have missing data, so we want to drop NA values during the pivot from wide to long. You will need to modify the code you wrote for your homework to do this.
NOTE - Most of the following tasks are simple modifications or reimplimentations of code you have already written.
First off, let’s take a quick look at this big data set to refresh our memory on the structure.
Task
Use the print() function to look the data.
## # A tibble: 696 × 7
## species primary_threat secondary_threat tertiary_threat population date
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Schistidiu… Habitat destru… <NA> <NA> pop_1 2013…
## 2 Schistidiu… Habitat destru… <NA> <NA> pop_1 2014…
## 3 Schistidiu… Habitat destru… <NA> <NA> pop_1 2015…
## 4 Schistidiu… Habitat destru… <NA> <NA> pop_1 2016…
## 5 Schistidiu… Habitat destru… <NA> <NA> pop_1 2017…
## 6 Schistidiu… Habitat destru… <NA> <NA> pop_1 2018…
## 7 Schistidiu… Habitat destru… <NA> <NA> pop_1 2019…
## 8 Paraleucob… Exploitation Habitat loss <NA> pop_1 2005…
## 9 Paraleucob… Exploitation Habitat loss <NA> pop_1 2006…
## 10 Paraleucob… Exploitation Habitat loss <NA> pop_1 2007…
## # … with 686 more rows, and 1 more variable: abundance <dbl>
If we look at this print out we can see that the date column isn’t specified in the date format, so let’s set that to a date first.
Task
Set the date column to date format.
So, the important bits for us (remember we want to know if there has been a significant change in the population over time) are going to be the date, abundance, and species columns for the moment.
As we said, we are going to use a single time series, so let’s extract the time series - using filter() for Trichocolea tomentella from the species data we loaded in.
Task
Filter the data to make a new data frame called single_spp containing only data on Trichocolea tomentella
## # A tibble: 18 × 7
## species primary_threat secondary_threat tertiary_threat population date
## <chr> <chr> <chr> <chr> <chr> <date>
## 1 Tricho… Climate change <NA> <NA> pop_1 2003-01-01
## 2 Tricho… Climate change <NA> <NA> pop_1 2004-01-01
## 3 Tricho… Climate change <NA> <NA> pop_1 2005-01-01
## 4 Tricho… Climate change <NA> <NA> pop_1 2006-01-01
## 5 Tricho… Climate change <NA> <NA> pop_1 2007-01-01
## 6 Tricho… Climate change <NA> <NA> pop_1 2008-01-01
## 7 Tricho… Climate change <NA> <NA> pop_1 2009-01-01
## 8 Tricho… Climate change <NA> <NA> pop_1 2010-01-01
## 9 Tricho… Climate change <NA> <NA> pop_1 2011-01-01
## 10 Tricho… Climate change <NA> <NA> pop_1 2012-01-01
## 11 Tricho… Climate change <NA> <NA> pop_1 2013-01-01
## 12 Tricho… Climate change <NA> <NA> pop_1 2014-01-01
## 13 Tricho… Climate change <NA> <NA> pop_1 2015-01-01
## 14 Tricho… Climate change <NA> <NA> pop_1 2016-01-01
## 15 Tricho… Climate change <NA> <NA> pop_1 2000-01-01
## 16 Tricho… Climate change <NA> <NA> pop_1 2001-01-01
## 17 Tricho… Climate change <NA> <NA> pop_1 2002-01-01
## 18 Tricho… Climate change <NA> <NA> pop_1 1999-01-01
## # … with 1 more variable: abundance <dbl>
And, as we should before all data analysis, we should visualise the data before we do anything. We will do this using ggplot() and fit a loess smoothing line so we can visualise any trends in the data:
##make the plot
p1 <- ggplot(single_spp, aes(x=date, y=abundance)) +
geom_point() +
geom_line() +
theme_bw() +
ylab("Abundance") +
xlab("Year")
##add the loess smoothing:
p1 + geom_smooth(method="loess")
Ok, so I think its pretty clear that the population is declining! But we want to know about how fast that population is changing - so we should fit a model to these data.
So, we know how to run a GLM in R (you’ve just done it), but one thing to consider is the x-axis on the plot above - R will accept date as a predictor variable (make sure it is set as a date!) but it is often better to replace the date with a numeric vector as it is easier to interpret (see below in the Model summaries and outputs section) - we are going to do this with the code below:
## calculate a new column (`standardised_time`) which is the difference between the
## starting date of the time series and each other date in weeks (see ?difftime)
## we will set this to a numeric vector
single_spp <- single_spp %>%
mutate(standardised_time = as.numeric(difftime(as.Date(date),
min(as.Date(date)),
units = "weeks")))
print(single_spp[,c("abundance", "date", "standardised_time")])
## # A tibble: 18 × 3
## abundance date standardised_time
## <dbl> <date> <dbl>
## 1 245 2003-01-01 209.
## 2 231 2004-01-01 261.
## 3 125 2005-01-01 313.
## 4 121 2006-01-01 365.
## 5 125 2007-01-01 417.
## 6 87 2008-01-01 470.
## 7 79 2009-01-01 522.
## 8 97 2010-01-01 574
## 9 42 2011-01-01 626.
## 10 35 2012-01-01 678.
## 11 41 2013-01-01 731.
## 12 49 2014-01-01 783.
## 13 55 2015-01-01 835.
## 14 40 2016-01-01 887
## 15 336 2000-01-01 52.1
## 16 212 2001-01-01 104.
## 17 204 2002-01-01 157.
## 18 368 1999-01-01 0
Now to fit our glm:
##fit a glm()
mod1 <- glm(abundance ~ standardised_time,
data = single_spp,
family = "gaussian")
As above we have specified out model, but here the independent variable is standardised_time and the dependant is abundance is our dependant variable. And again, we have specified the “family” to be “gaussian”.
So we did a very brief assessment of the model fit in the first example, but in reality we are going to want to spend more time and energy digging to find out how well out model really does fit the data. This is especially important as the models become more complex.
ONE THING TO NOTE - you are never going to get a model which perfectly fits your data! (hence the “all models are wrong…”). We need to get a model which fits as best we can, so that any interpretations of the data are (with the highest likelihood) correct.
First off, lets produce plots like those in the pre-session reading to visualise how the model fits the standardised_time, the residuals, and a few other tools for assessing how well the model fits.
To do this we need to extract some values from the model, specifically:
We can do this using the predict() and resid() functions.
Note - when we pass the model to the predict function it will return the predicted y values for each x value in the data we used to fit the model. i.e. if our predictor variable is c(1, 2, 3, 4, 5) it will return the predicted y values for each of these x values in the same order. This means we can simply add these predicted values straight back onto the data we fit the model to (in a new column).
##return the predicted (response) values from the model
##and add them to the single species tibble:
single_spp$pred_gaussian <- predict(mod1,
type="response")
Task
Use the resid() function to add the residuals of the model to the data.frame in the same was as you added the predicted values.
##return the model residuals and add to the single species tibble:
single_spp$resid_gaussian <- resid(mod1)
Now we’ve added these data to our original data frame we can start to plot some of the plots in the style of the Error distributions section above.
First off let’s plot the data again, and add in the predicted values from the model as a line:
## plot the abundances through time
p2 <- ggplot(single_spp, aes(x = standardised_time,
y = abundance)) +
geom_point() +
geom_line() +
theme_bw() +
ylab("Abundance") +
xlab("standardised_time")
##add in a line of the predicted values from the model
p2 <- p2 + geom_line(aes(x = standardised_time,
y = pred_gaussian),
col = "dodgerblue",
size = 1)
## we can also add in vertical blue lines which show the redsidual error of the model
## (how far the observed points are from the predicted values).
## in geom_segement we specify where we want the start and end of the segments (lines)
## to be. Without any prompting ggplot assumes that we want the start of the lines to
## be taken from the x and y values we are plotting using the ggplot() function
## i.e. standardised_time and abundance, so we just need to specify the end points of
## the lines:
p2 <- p2 +
geom_segment(aes(xend = standardised_time,
yend = pred_gaussian),
col="lightblue")
## add a title
p2 <- p2 + ggtitle("Fitted model (gaussian with identity link)")
##print the plot
p2
Eye-balling this it looks like a pretty good fit, which is nice.
Next, let’s check the residuals around the fitted values:
##plot a histogram of the residuals from the model using geom_histogram()
p3 <- ggplot(single_spp, aes(x = resid_gaussian)) +
geom_histogram(fill="goldenrod") +
theme_minimal() +
ggtitle("Histogram of residuals (gaussian with identity link)")
## print the plot
p3
The residuals aren’t what you would describe as “normally distributed” which is a concern.
Let’s look at how the residuals change with the predicted values.
Task
Try to make the plot below using your own code, plotting the predicted vs residuals in the single_spp data.frame
##make the plot of predicted vs residuals
p4 <- ggplot(single_spp,
aes(x = pred_gaussian,
y = resid_gaussian)) +
geom_point() +
theme_minimal() +
xlab("Predicted values") +
ylab("residuals") +
ggtitle("Predicted vs residual (gaussian with identity link)") +
##using geom_smooth without specifying the method (see later) means geom_smooth()
##will try a smoothing function with a formula y~x and will try to use a loess smoothing
##or a GAM (generalised additive model) smoothing depending on the number of data points
geom_smooth(fill="lightblue", col="dodgerblue")
##print it
p4
Your plot definately shows a banana shape, not the expexted straight line. Actually we could probably have predicted this - if we look at the first plot we can see that at standardised_time 0 and standardised_time > 750 the line constantly underpredicts the observed values, and hence we have higher residual errors - reflected in the banana shape curve you see above.
Let’s look at one last diagnostic plot, the Q-Q plot. QQ plots appeared earlier on when we used the plot() function on our model, and they are incredibly useful for interpreting how well models fit data (more info on them here. R has a helpful function to make these plots automatically:
##plot the qq plot for the residuals from the model assuming a normal distribution,
## and add the straight line the points should fall along:
qqnorm(single_spp$resid_gaussian); qqline(single_spp$resid_gaussian)
We can see that the points also deviate from the expected straight line. QQ plots can give us quite a lot of information about the discrepancies in the data when compared to the theoretical distribution (in this case the gaussian) we have assumed describes the residuals in the model.
Task
Take 5 minutes to have a look at this application. Explore how the different skewness and tailedness in the data affect the QQ plots, and then decide what kind of skewness/tailedness we might have in our data plotted above. What sort of skew does our data have?
Using this application it seems our data is likely to be negatively skewed. Let’s keep that in mind for later.
So what are our conclusion from this preliminary investigation? Well, we can see that the model does an OK (but not great) job of fitting to the data - we need to investigate some other possible models to see if we can identify one which does a better job.
Remember - Generally speaking, we are trying to minimise the residual error in the above plots. A model with less residual error fits our data better.
So the next thing to do is fit some alternative models to the data to see if they do a better job.
The data are counts (abundance of species at each standardised_time) so one very sensible option is a glm with poisson error distribution.
## fit a glm with a poisson distribution
mod2 <- glm(abundance ~ standardised_time,
data = single_spp,
family = "poisson")
Another option, give the skewness in the data we explored above, would be a gaussian glm with a log-link, which might help solve our skewness issue…
## fit a glm with a gaussian distribution with a log link
mod3 <- glm(abundance ~ standardised_time,
data = single_spp,
family = gaussian(link = "log"))
We could also try a guassian model with an inverse link:
## we could also try a guassian model with an inverse link
mod4 <- glm(abundance ~ standardised_time,
data = single_spp,
family = gaussian(link = "inverse"))
Ok, so now we have four potential models (mod1, mod2, mod3, mod4). We can compare the fits of these models to the data using the Akaike information criterion (AIC). Actually, because we are comparing models which have different distributions, we need to compare them using a corrected AIC to account for the fact that some of the distributions take more data to estimate than others. If we are comparing models with the same distribution (and, for example, different predictor variables) then using AIC would be appropriate. So, to calculate AICc:
##install the gamlr package, and then:
library(gamlr)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##compare the models
AIC_mods <- data.frame(model = c("mod1", "mod2", "mod3", "mod4"),
AICc = c(AICc(mod1), AICc(mod2), AICc(mod3), AICc(mod4)))
## rank them by AIC using the order() function
AIC_mods[order(AIC_mods$AICc),]
## model AICc
## 3 mod3 177.1444
## 4 mod4 185.2119
## 1 mod1 192.9506
## 2 mod2 211.2472
Task
Try to work out how the order() function works - and why we use it in the square brackets for AIC-mods. To do this try running chunks of the code above in isolation.
Great. So it looks like the guassian model with a log-link (mod3) fits the data the best (has the lowest AIC). Conveniently none of these models are close (within 2 AIC) of the best fitting model, so we can concentrate on mod3 moving forward.
Whilst AIC tells us about the comparitive fits of the different models, it doesn’t tell us how well the models actually fit the data. They could all just fit it really badly, and mod3 just fits it least badly! So we need to go back and check the fits of the model again.
Task
Produce the plot below, adapting the code we developed earlier to produce the plot with the fits and residuals for a gaissian distribution with a identity link.
Hint - you can copy and paste most of the code to do this
##return the predicted (response) values from the model and add them to the single species tibble:
single_spp$pred_gaussian_log <- predict(mod3,
type="response")
##return the model residuals and add to the single species tibble:
single_spp$resid_gaussian_log <- resid(mod3)
##first off let's plot the data again, and add in the predicted values from the model as a line. We can modify the plot we started earlier:
p5 <- ggplot(single_spp, aes(x=standardised_time, y=abundance)) +
geom_point() +
geom_line() +
theme_bw() +
ylab("Abundance") +
xlab("standardised_time")
##add in a line of the predicted values from the model
p5 <- p5 + geom_line(aes(x = standardised_time,
y = pred_gaussian_log),
col = "dodgerblue",
size = 1)
## we can also add in lines showing the distance of each observation from
## the value predicted by the model (i.e. these lines visualise the residual error)
p5 <- p5 + geom_segment(aes(xend = standardised_time,
yend = pred_gaussian_log),
col="lightblue")
## add a title
p5 <- p5 + ggtitle("Fitted model (gaussian with log link)")
##print the plot
p5
That looks like a good fit!
Let’s have a look at some of the other measures:
##plot the diagnostic graphics for model 3
plot(mod3)
Briefly, for the above plots:
In all, we have a model which fits the data well and which we can be pretty confident in.
Note - For a more indepth discussion of how to interpret the above diagnostic plots see the excellent blog post here.
One thing we should bear in mind is that the gaussian distribution is not specifically formulated for count data - it assumes that error values are going to be continuous, whereas we know that isn’t the case with our data. This can be a cause for concern, and whilst strictly we shouldnt do this (as it violates the assumptions of the normal distribution) we can usually get away with it when the mean of the dependant variable is high (as a rule of thumb > 50 - which is the case here).
So we have fit our model, and ensured that our model fits the data well. Now we want to look at what our model says about the data.
First off let’s look at the model output, using the summary() function:
##summarise the model outputs
summary(mod3)
##
## Call:
## glm(formula = abundance ~ standardised_time, family = gaussian(link = "log"),
## data = single_spp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -58.133 -18.337 0.437 16.949 55.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8867705 0.0534321 110.2 < 2e-16 ***
## standardised_time -0.0027565 0.0002377 -11.6 3.38e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 806.5102)
##
## Null deviance: 184068 on 17 degrees of freedom
## Residual deviance: 12904 on 16 degrees of freedom
## AIC: 175.43
##
## Number of Fisher Scoring iterations: 4
Question
Using your newly aquired knowledge of how to read model summaries in R:
is there a significant change in the population over time?
We can see there is a significant negative effect of standardised_time on the dependent variable (abundance). Its now easy to see why we have converted the date to standardised_time, having converted these to a continuous numeric predictor (number of weeks since the first recorded abundance) we can more easily interpret the output: there is a decrease of ~0.003 individuals per week over the time series.
We can plot the output of this model, which is great to visualize how well it fits the data and is ideal for things like publications and to communicate the results of your analyses:
## first off let's plot the data again
p6 <- ggplot(single_spp, aes(x=standardised_time, y=abundance)) +
geom_point() +
geom_line() +
theme_bw() +
ylab("Abundance") +
xlab("standardised_time")
## use the geom_smooth() function to add the regression to the plot.
## unlike earlier here we are specifying the model type (glm), the formula,
## and the error structure and link
p6 <- p6 + geom_smooth(data=single_spp,
method="glm",
method.args = list(family = gaussian(link="log")),
formula = y ~ x,
col = "dodgerblue",
fill = "lightblue")
##print the plot
p6
You can see that ggplot() conveniently calculates the confidence intervals around the line, giving us a nice visualization of the fitted model.
Let’s quickly summarise what we have learned today:
Phew! Thats a lot to cover in one week. Don’t worry if you didn’t manage to work through everything or some parts you find difficult - you can always come back to this and refer to it at any time.
| Function/operator | Call |
|---|---|
| Apply smoothing visualisation to ggplot() | geom_smooth() |
| Calculate difference between dates | difftime() |
| Run a GLM model | glm() |
| Return predicted values from a model | predict() |
| Return the residuals of a glm | resid() |
| Make a QQ plot | qqnorm() |
| Add a QQ line to a QQ plot | qqline() |
| Compare models using AICc | AICc() |
| Summarise a model | summary() |
| Order the items in a column | order() |
First of all there isn’t a right or a wrong way of doing last week’s homework. Actually, this is usually the case with R - there are better (easier to read code, less likely to make mistakes, faster) and worse (the opposite of all of those things) ways of doing things but not neccesarily a right and wrong way. Your homework was about thinking outside of the box, being creative, investigating the data, and trying to come up with an elegent visualisation. Here is one option, working from the long format of the data you generated last week:
## make a new column in long_spp which is the number of threats recorded for each population
##this is the length of the non-NA values in each row of the colunns primary_threat, secondary_threat, and tertiary_threat:
long_spp <- long_spp %>%
rowwise %>%
mutate(n.threats = length(which(!is.na(c(primary_threat,
secondary_threat,
tertiary_threat)))))
##load in the scales package for the psudo_log transformation
library(scales)
##build the ggplot, setting the date to a date format
ggplot(long_spp, aes(x=as.Date(date),
y=abundance,
col=n.threats))+
##set the theme to minimal
theme_minimal() +
##position the legend at the top on the left
theme(legend.position = "top",
legend.justification="left") +
## change the automatic label for the legend to the following
## note it is done twice (once for fill and once for col) as
## we change the colour of the lines and the fill of the smooth according to the number of threats
labs(col="Number of threats", fill="Number of threats") +
## add in the lines of the population, using the group and interaction functions
## to specify how the data are grouped and get 1 line per population
geom_line(aes(group = interaction(population,
species,
n.threats,
population)),
alpha=0.2,
size=1) +
## add in a smoothed line across all of the population time series for each number of threats
## this visualises the mean change across the populations
geom_smooth(aes(group = n.threats,
fill = n.threats), alpha=0.3, size=1) +
##change the colours to reflect the number of threats, using a gradient
scale_color_gradient(low = "#745745",
high = "#d14124") +
scale_fill_gradient(low = "#745745",
high = "#d14124") +
## transform the y axis using a psudo_log transformation
## (as we have some 0 counts and you can't log 0)
scale_y_continuous(trans="pseudo_log") +
##change the x and y axis labels
ylab("Population abundance") +
xlab("Year")
So here we have thought a little outside the box and looked at how the number of stressors that a population is threatened by alters its trajetory through time - it seems like the more threats you have the faster you are declining, on average.
How you visualise your data will of course depend on exactly the questions you are interested in. Plotting as above gives the advantage of being able to directly compare between the different groups (0, 1, 2, 3 stressors) but at the expense of the species-level data.
A couple of other neat tricks are worth highlighting:
This week’s homework is split into two parts. Make an Rmarkdown document containing both of the homework solutions (seperated into two parts so it is easy to read).
For the first part of this week’s homework you will be seeing whether there is a significant effect of a country’s GDP on their position in the Tokyo olympics medal table (cynical, I know!). The medal table is available in the Bioinformatics Data Github repository here. You will need to:
To build on the work you have done above this week’s homework will be to further analyse the iris data you looked at in Section 2. We started off by analysing this to see if there were differences in the mean Sepal.Width between the different species, for homework you are going to see whether there is a positive effect of Petal.Length on Petal.Width, and whether this differs amongst the three species.
To do this you will need to:
log10(), sqrt(), etc)Petal.Length on Petal.Width? does this change between species?).To do this you will need to go slightly beyond where we left the practical today - you will need to fit a model which includes both Petal.Length AND Species as predictor variables. We will go more into this next week, but for now what you need to know is that this can be achieved by using the * operator: Petal.Length*Species, instead of including a single predictor as we did in here. The * operator says “include the effects of Petal.Length and Species, and the interaction between Petal.Length and Species in my model”. This will allow you to discern:
Petal.Length on Petal.WidthSpecies on Petal.WidthPetal.Length on Petal.Width between different speciesPetal.Length on Petal.Width change between different species.